Additional stuff: Presentation Dashboard
Our customer is a platform «Callme», which supplies different services regarding VoIP (software, internet call channels, hardware etc.), the ultimate clients are the callcenters. Callcenters hire their staff - operators - to make outgoing calls and receive incoming calls from abonents (private individuals, companies etc - all depending on the business demand and circumstances). Each callceneter has its own incoming/outcoming calls' ratio according to its situation and requirements.
The customer has a need to learn how to isolate inefficient operators using special algorithms, which in the future can be sold to customers (callcenters) as a new service. We are required to analyze the available data on calls and develop such an algorithm.
Signs of low efficiency of operators that the customer suggests to take into account:
Datasets describe the use of the services of "Callme", a virtual telephony provider. Its clients are the callcenters that:
Also, operators can make internal calls — calls between each other inside the virtual telephony network.
The columns in telecom_dataset.csv.zip:
internal — Is the call an internal call between the client's operatorsis_missed_call — Is the call missedtotal_call_duration — Duration of the call (taking into account the waiting time)The columns in telecom_clients.csv:
tariff_plan — The current tariff plan of the client# importing the libraries needed for work
import pandas as pd
from scipy import stats as st
import datetime as dt
import numpy as np
import math as mth
from matplotlib import pyplot as plt
import seaborn as sns
import os
import plotly.express as px
def file_path(name):
"""
File search function (lowercase file name is the function argument) in Linux and Windows folders
(if the notebook works locally).
We read the data file with an attempt to take data locally: 4 attempts for the Yandex-server and
Windows. The file should be located next to the notebook or in the nested directory "datasets".
"""
for p in ('/datasets', '', os.path.abspath(""), os.path.abspath("")+'\\datasets\\'):
pth1 = os.path.join(p, name)
if os.path.exists(pth1):
break # if the search is successful, we break the cycle ahead of schedule
else:
pth1 = ''
print(f"The {name} file wasn't found. Make sure that it",
'is present in the directory next to the notebook or in a subfolder "/datasets"')
print('Data reading from:', pth1, '\n')
return pth1
name = 'telecom_dataset.csv' # the name of the file to be found and read
df = pd.read_csv(file_path(name)) # writing the data file to the dataframe
df.head(3) # let's check the outlook of data
Data reading from: telecom_dataset.csv
| user_id | date | direction | internal | operator_id | is_missed_call | calls_count | call_duration | total_call_duration | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 166377 | 2019-08-04 00:00:00+03:00 | in | False | NaN | True | 2 | 0 | 4 |
| 1 | 166377 | 2019-08-05 00:00:00+03:00 | out | True | 880022.0 | True | 3 | 0 | 5 |
| 2 | 166377 | 2019-08-05 00:00:00+03:00 | out | True | 880020.0 | True | 1 | 0 | 1 |
name = 'telecom_clients.csv' # the name of the file to be found and read
df2 = pd.read_csv(file_path(name)) # writing the data file to the dataframe
df2.head(3) # let's check the outlook of data
Data reading from: telecom_clients.csv
| user_id | tariff_plan | date_start | |
|---|---|---|---|
| 0 | 166713 | A | 2019-08-15 |
| 1 | 166901 | A | 2019-08-23 |
| 2 | 168527 | A | 2019-10-29 |
len_df_ini = len(df) # rows in the dataframe before processing begins (we'll use it to control "losses")
len_df_ini
53902
def lost_data(df):
"""
The function of checking the proportion of losses in the dataframe columns
"""
out = pd.DataFrame(columns = ['column_names','omissions','lost_portion'])
i = 0
for column in df.columns:
if df[column].isna().sum() > 0:
out.loc[i] = [column, df[column].isna().sum(), round(df[column].isna().sum()/len(df), 4)]
i += 1
print(f"\nLost data in the columns: columns with omissions: {i} from {len(df.columns)}.")
out.set_index('column_names', inplace=True)
return out
df.drop_duplicates(inplace=True) # remove the explicit duplicates and
print(df.info()) # let's see what's with the losses in the first dataframe
print(lost_data(df).sort_values('lost_portion'))
print('\n---After removing explicit duplicates in the dataframe df',round((len_df_ini-len(df))*100/len(df),2),'% of rows are deleted.')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 49002 entries, 0 to 53901
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 49002 non-null int64
1 date 49002 non-null object
2 direction 49002 non-null object
3 internal 48892 non-null object
4 operator_id 41546 non-null float64
5 is_missed_call 49002 non-null bool
6 calls_count 49002 non-null int64
7 call_duration 49002 non-null int64
8 total_call_duration 49002 non-null int64
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 3.4+ MB
None
Lost data in the columns: columns with omissions: 2 from 9.
omissions lost_portion
column_names
internal 110 0.0022
operator_id 7456 0.1522
---After removing explicit duplicates in the dataframe df 10.0 % of rows are deleted.
What can be seen from the first dataframe:
# let's check, are there any implicit duplicates or errors in 3 columns
for col in ('direction','internal','is_missed_call'):
print(f'Unique values in the column {col} -', df[col].unique())
Unique values in the column direction - ['in' 'out'] Unique values in the column internal - [False True nan] Unique values in the column is_missed_call - [ True False]
df.dropna(subset=['internal'], inplace=True) # removing the gaps
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d') # changing type
df['internal'] = df['internal'].astype('bool') # changing type
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48892 entries, 0 to 53901 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 48892 non-null int64 1 date 48892 non-null datetime64[ns, pytz.FixedOffset(180)] 2 direction 48892 non-null object 3 internal 48892 non-null bool 4 operator_id 41491 non-null float64 5 is_missed_call 48892 non-null bool 6 calls_count 48892 non-null int64 7 call_duration 48892 non-null int64 8 total_call_duration 48892 non-null int64 dtypes: bool(2), datetime64[ns, pytz.FixedOffset(180)](1), float64(1), int64(4), object(1) memory usage: 3.1+ MB
print('---Now in the dataframe df', round((len_df_ini - len(df)) * 100 /len(df), 2), '% of rows are deleted.')
---Now in the dataframe df 10.25 % of rows are deleted.
Let's move on to the question: operator_id has 15% of gaps (7456), how to handle that?
print('Number of zero-duration calls with "lost" operator_id (non-call):')
df[(df.operator_id.isna()) & (df.call_duration == 0)]['user_id'].count()
Number of zero-duration calls with "lost" operator_id (non-call):
7288
We will replace these calls in the operator ID with NaN for an ID = 100000.0 (this will be ID "operator of lost calls without an operator")
100000.0 in df.operator_id # на всякий случай убедимся, что ID 100000.0 пока свободен
False
# assign an artificial operator ID 100000.0 where operator = NaN and call length = 0
df['operator_id'] = df['operator_id'].where((df.operator_id.notna()) | (df.call_duration != 0), other=100000.0)
Let's check non-zero duration calls with the sign is_missed_call:
print('---Now in the dataframe df', round((len_df_ini - len(df)) * 100 /len(df), 2), '% of rows are deleted.')
---Now in the dataframe df 10.25 % of rows are deleted.
print('Calls with property is_missed_call of non-zero length:')
df[(df.is_missed_call == True) & (df.call_duration != 0)].head()
Calls with property is_missed_call of non-zero length:
| user_id | date | direction | internal | operator_id | is_missed_call | calls_count | call_duration | total_call_duration | |
|---|---|---|---|---|---|---|---|---|---|
| 1606 | 166405 | 2019-11-19 00:00:00+03:00 | in | False | 939478.0 | True | 1 | 165 | 173 |
| 1635 | 166405 | 2019-11-21 00:00:00+03:00 | in | False | 882686.0 | True | 1 | 1 | 5 |
| 1657 | 166405 | 2019-11-22 00:00:00+03:00 | in | False | 882686.0 | True | 1 | 1 | 5 |
| 2283 | 166407 | 2019-09-23 00:00:00+03:00 | in | False | 888534.0 | True | 1 | 133 | 156 |
| 2869 | 166482 | 2019-11-05 00:00:00+03:00 | in | False | 934074.0 | True | 1 | 1 | 15 |
Since we do not have an explanation of how such calls can have a missed sign (1 second is also not missed), we will change this sign to the opposite for all calls lasting more than 0 (most likely seconds, but there is no data from the customer).
df['is_missed_call'] = df['is_missed_call'].where((df.is_missed_call==False ) | (df.call_duration == 0), other=False)
df[(df.is_missed_call == True) & (df.call_duration != 0)].head() # calls from one second are no longer listed as missed
| user_id | date | direction | internal | operator_id | is_missed_call | calls_count | call_duration | total_call_duration |
|---|
Let's see how many calls there are now without an operator ID:
print('---Now in the dataframe df', round((len_df_ini - len(df)) * 100 /len(df), 2), '% of rows are deleted.')
---Now in the dataframe df 10.25 % of rows are deleted.
df[(df.operator_id.isna())]['user_id'].count()
113
Very few, so let's delete them.
# deleting erroneous rows by indexes
df = df.drop(df[(df.operator_id.isna())].index)
print(lost_data(df).sort_values('lost_portion'))
Lost data in the columns: columns with omissions: 0 from 9. Empty DataFrame Columns: [omissions, lost_portion] Index: []
Let's look at the average full length of the call, the highs fit into 1 hour, we consider as normal:
(df['total_call_duration']/df['calls_count']).sort_values().tail(5)
31087 2329.0 8986 2618.0 34461 2904.0 8370 2907.5 14787 3609.0 dtype: float64
Let's look at the full length of the call - the highs give about 46 hours, it looks like the counter is stuck on weekends (the theoretical maximum is 48 hours). We will allow a maximum time of 9 hours (a working day with a lunch break), we will remove everything superfluous from the database.
df.total_call_duration.sort_values().tail(3) # maximums up to 46 hours
6808 141930 6950 146903 6946 166155 Name: total_call_duration, dtype: int64
# deleting erroneous rows by indexes
df = df.drop(df[(df.total_call_duration > (9*60*60))].index)
df.total_call_duration.sort_values().tail(3) # checking the result
30184 30864 7113 32148 6790 32291 Name: total_call_duration, dtype: int64
print('---Now in the dataframe df', round((len_df_ini - len(df)) * 100 /len(df), 2), '% of rows are deleted.')
---Now in the dataframe df 10.81 % of rows are deleted.
Conclusion to 1.2:
We will form histograms by dataframe columns, not including the direction, internal and is_missed_call fields. The graphs are built in Plotly, they are interactive and can be scaled (the zoom function is very useful here).
fig = px.histogram(df, x="user_id")
fig.update_layout(title='Histogram of call center IDs',
width=600, # specifying the size of the graph
height=300,
)
fig.show() # printing the graph
fig = px.histogram(df, x="date")
fig.update_layout(title='Histogram by time',
width=600, # specifying the size of the graph
height=300,
)
fig.show() # printing the graph
fig = px.histogram(df, x="operator_id")
fig.update_layout(title="Histogram for operators' ID (fiction 100000.0 is in the left.)",
width=600, # specifying the size of the graph
height=300,
)
fig.show() # printing the graph
fig = px.histogram(df, x="calls_count")
fig.update_layout(title='Histogram of the number of calls',
width=600, # specifying the size of the graph
height=300,
)
fig.show() # printing the graph
fig = px.histogram(df, x="call_duration")
fig.update_layout(title='Histogram of the duration of calls (cumulative)',
width=600, # specifying the size of the graph
height=300,
)
fig.show() # printing the graph
fig = px.histogram(df, x="total_call_duration")
fig.update_layout(title='Histogram of the total duration of calls (cumulative)',
width=600, # specifying the size of the graph
height=300,
)
fig.show() # printing the graph
print('Items in dataset.')
len(df)
Items in dataset.
48645
print('Callcenters in dataset.')
len(set(df.user_id))
Callcenters in dataset.
307
print('Callcenters IDs in dataset.')
df.user_id.value_counts() # let's take a look at how many times different operator IDs occur
Callcenters IDs in dataset.
168187 2342
166658 1882
167497 1281
166916 1200
167626 1167
...
168578 1
168459 1
166554 1
166584 1
167981 1
Name: user_id, Length: 307, dtype: int64
print('Number of operators in the dataset (including fiction 100000.0).')
len(set(df.operator_id))
Number of operators in the dataset (including fiction 100000.0).
1093
print('Operator IDs in the dataset.')
df.operator_id.value_counts() # let's take a look at how many times different operator IDs occur
Operator IDs in the dataset.
100000.0 7286
901884.0 323
901880.0 300
893804.0 289
887276.0 267
...
960666.0 1
899906.0 1
899912.0 1
883018.0 1
937770.0 1
Name: operator_id, Length: 1093, dtype: int64
rr = df.groupby('date', as_index=False)['operator_id'].nunique()
fig = px.line(rr, x='date',y='operator_id')
fig.update_layout(title='Number of calling operators per day',
width=600, # specifying the size of the graph
height=400,
xaxis_title='Time',
yaxis_title='Quantity'
)
fig.show() # printing the graph
op = df.groupby('date', as_index=False)['operator_id'].nunique() # operators quantities
cc = df.groupby('date', as_index=False)['user_id'].nunique() # callcenters quantities
len(op) == len(cc) # are the lengths equal, aren't there gaps in the data?
True
# this column stores the sums of operators divided by number of callcenters
op['op_p_user'] = op['operator_id'] / cc['user_id']
fig = px.line(op, x='date',y='op_p_user')
fig.update_layout(title='The number of operators per 1 callcenter',
width=600, # specifying the size of the graph
height=400,
xaxis_title='Time',
yaxis_title='Quantity'
)
fig.show() # printing the graph
rr = df.groupby('date', as_index=False)['calls_count'].sum()
fig = px.line(rr, x='date',y='calls_count')
fig.update_layout(title='Calls per day',
width=600, # specifying the size of the graph
height=400,
xaxis_title='Time',
yaxis_title='Quantity'
)
fig.show() # printing the graph
rr = df.groupby('date', as_index=False)['user_id'].nunique()
fig = px.line(rr, x='date',y='user_id')
fig.update_layout(title='Active callcenters per day',
width=600, # specifying the size of the graph
height=400,
xaxis_title='Time',
yaxis_title='Queantity'
)
fig.show() # printing the graph
The growing business brings certain difficulties in accounting for both callcenters and operators. The number of callcenters gradually increases in the first half of the period. It is possible that there are some customers who have stopped cooperating with "Callme" during the given period. The solution suggests itself - to cut off 1/3 or half of the period.
In regard of callcenters. Analytics in the 3rd section works on grouping data for each call center separately, time data is taken from the filtered xx.date, and not from the original df.date, so the beginning of the period works correctly. But on the right border of the time interval, the code has been corrected, now it will check whether the client's activity has been in the last d_count days (a test for the client's relevance). For more information, see the function for selecting inefficient operators.
In regard of operators. We don't do a relevance test (operators go on vacation or return back, it's difficult). If the number of operators in the callcenter fluctuates, we do nothing. If the business is dying and the number of operators is reducing - we do nothing. If there are 2 times more operators in the second half of the time period than in the first half, there is rapid growth in the callcenter, together with a lot of changes, so we cut the time period by its older half. For more information, see the function for selecting inefficient operators. (Across the entire dataframe, the ratio is 461 to 997)
print('The beginning and end of the time interval in the data.')
df.date.min(), df.date.max()
The beginning and end of the time interval in the data.
(Timestamp('2019-08-02 00:00:00+0300', tz='pytz.FixedOffset(180)'),
Timestamp('2019-11-28 00:00:00+0300', tz='pytz.FixedOffset(180)'))
print('The numbers of external and internal calls:')
df.internal.value_counts()
The numbers of external and internal calls:
False 43005 True 5640 Name: internal, dtype: int64
print('The numbers of outgoing and incoming calls:')
df.direction.value_counts()
The numbers of outgoing and incoming calls:
out 28821 in 19824 Name: direction, dtype: int64
print('The numbers of calls (accumulated cumulatively).')
df.calls_count.value_counts()
The numbers of calls (accumulated cumulatively).
1 12933
2 6366
3 4047
4 3028
5 2224
...
380 1
514 1
540 1
645 1
748 1
Name: calls_count, Length: 415, dtype: int64
print('Missed calls (number of rows).')
df.is_missed_call.sum()
Missed calls (number of rows).
21097
df2.drop_duplicates(inplace=True) # remove the explicit duplicates and
print(df2.info()) # let's see what's with the losses in the second dataframe
print(lost_data(df2).sort_values('lost_portion'))
<class 'pandas.core.frame.DataFrame'> Int64Index: 732 entries, 0 to 731 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 732 non-null int64 1 tariff_plan 732 non-null object 2 date_start 732 non-null object dtypes: int64(1), object(2) memory usage: 22.9+ KB None Lost data in the columns: columns with omissions: 0 from 3. Empty DataFrame Columns: [omissions, lost_portion] Index: []
df2['date_start'] = pd.to_datetime(df2['date_start'], format='%Y-%m-%d') # changing type
df2.info() # checking the result
<class 'pandas.core.frame.DataFrame'> Int64Index: 732 entries, 0 to 731 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 732 non-null int64 1 tariff_plan 732 non-null object 2 date_start 732 non-null datetime64[ns] dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 22.9+ KB
print('For reference. Distribution of tariffs.')
df2.tariff_plan.value_counts()
For reference. Distribution of tariffs.
C 395 B 261 A 76 Name: tariff_plan, dtype: int64
Conclusion to 1.3:
The project may be of interest to those callcenters where a large number of operators work and it is impossible to carry out quality control over each of them. Let's see how the number of employees is distributed across callcenters (see section 1.3 for a graph of average values).
yy = df.groupby('user_id', as_index=False)['operator_id'].nunique().sort_values(by='operator_id')['operator_id']
yy.plot(kind='hist', figsize=(5, 4), bins=50)
plt.xticks(rotation = 0)
plt.title('Number of operators by callcenters')
plt.ylabel('Callcenters number')
plt.xlabel('Operators number')
plt.show();
n = 5 # after this number, we believe that there are quite a lot of operators working in the callcenter
print(f'Portion of callcenters, where operators number > {n}, {round(100 * yy.where(yy>n).notna().sum()/len(yy))}%')
Portion of callcenters, where operators number > 5, 20%
Let's look at how the information is collected in the dataframe provided by the client. Filter by one of the dates and operator ID.
df[(df.date == '2019-10-28') & (df.operator_id == 948182.0)]
| user_id | date | direction | internal | operator_id | is_missed_call | calls_count | call_duration | total_call_duration | |
|---|---|---|---|---|---|---|---|---|---|
| 53451 | 168416 | 2019-10-28 00:00:00+03:00 | out | False | 948182.0 | False | 12 | 539 | 846 |
| 53452 | 168416 | 2019-10-28 00:00:00+03:00 | in | False | 948182.0 | False | 1 | 97 | 104 |
| 53454 | 168416 | 2019-10-28 00:00:00+03:00 | out | False | 948182.0 | True | 5 | 0 | 98 |
As we can see, the data is already aggregated: we have the number of calls (calls_count) and their total duration (call_duration, total_call_duration) for each combination of date, direction, internal, operator_id. The request to the customer is to provide non-aggregated data, because they can be processed in a large number of ways.
Talking in "tableau" terminology, we have measurement columns - date, direction, internal, operator_id, is_missed_call, it will be possible to filter and select the data by them. And according to the columns of measures - calls_count, call_duration, total_call_duration - numerical metrics can be considered.
Introductory remarks.
What is the "efficient" and "inefficient" work of operators inside their callcenters? Callcenters are different, their business tasks are different, in addition, over time, their profile (duration of calls, ratio of incoming /outgoing calls, etc.) may change. Therefore, attempts to put metrics taken from somewhere outside for each of the callcenters may be successful, or may not be.
Let's take as an axiom (an axiom is something that does not require a proof) the statement that most operators want, strive and work well (i.e. efficiently). Then the sample (target) for each callcenter within a certain time interval will be its average indicators (metrics), since they are typical for most operators. "Anomalies" can be regarded as a sign of inefficient work: for example, the operator may talk too little or too long with the clients, and both will be regarded as a "deviation from the effective average". The final decision of course belongs to the client, since our analysis only highlights the deviations.
The operation of the algorithm.
There can be three situations: the studied criterion (research column) has inefficient operators in the left, in the area of small values, in the right – in the area of large values, or on both sides beyond the boundaries of the segment in the center of which lies the average, and the boundaries are plus or minus the standard deviation multiplied by the global multiplier k (for example 1.0). Signs (levels) of anomaly search in the left and right are set by the parameters l and r.
If the anomalies are on both sides, then the targeted (desired, ideal) average is calculated over the entire time range. But for those searches only on the left or right, the algorithm is more complex. We have to find the best indicator in the past and set it as a model (target), relatively to which the indicators of efficiency or inefficiency are determined.
If the anomalies are only on the left, then the time is divided into segments by d_count days (this is a global parameter, set for all tests), and at each time interval its average (srd) is calculated, and then the largest one is selected from these averages, which becomes a target. Relative to this targeted average, k standard deviations are set to the left (the deviation is calculated over the entire time in the dataframe) and all operators with indicators left from this boundary are marked as anomalous. For anomalies located only in the right, the process is mirrored.
About calculated metrics.
We are able to choose different metrics: primary (by dataframe fields) and secondary (their mathematical combinations). The more columns in the data and the fewer aggregations, the more metrics can be used. To begin with (see the next section), we will construct those three that the client offered us and two more for analyzing the time of incoming and outgoing calls.
Which metrics can be considered (the list is not complete, because one can average, search for maximums and minimums, divide, etc.):
# variables for all studies
k = 1.0 # good operators include those who are from the average +- k standard deviations
d_count = 10 # how many days are in each interval into which we divide the dataframe
def report(dframe): # report generation function
c = dframe.callcenter_id.nunique()
o = dframe.operator_id.nunique()
print(f'\n{dframe.name}.\nFound {c} callcenters, having {o} "anoumalous" operators.')
def find_uneff(l, r, dlt, xx): # the function for selecting inefficient operators
"""
l - boolean is a key telling that we are looking for inefficient operators in the left in small values
r - boolean is a key telling that we are looking for inefficient operators in the right in large values
dlt (k*np.std) - the run-up plus or minus from the average, beyond this segment the operators are abnormal
xx - the raw dataframe on which the search is going on, the callcenter is fixed when the function is called
"""
# if the callcenter worked d_count days ago, we count it as active, else return []
if df.date.max()-xx.date.max() < dt.timedelta(days=d_count):
# if there are 2 times more operators in the second half of the time interval than in the first -
# there is rapid growth in the callcenter, a lot of changes, so we cut out the left part of the period
half = (xx.date.max() - xx.date.min()) / 2
if (xx[xx.date >= xx.date.min()+half]['operator_id'].nunique() >
2 * xx[xx.date < xx.date.min()+half]['operator_id'].nunique() ):
xx = xx[xx.date >= xx.date.min()+half] # have cut out the left part of the period
if l & r:
srd = np.mean(xx.research) # aggregation might be not mean (e.g. sum), it should be everywhere the same
xx = xx[['date','operator_id','research']].groupby('operator_id',as_index=False)['research'].mean()
gr = [xx.loc[i]['operator_id'] for i in range(len(xx)) if xx.loc[i]['research'] > srd+dlt]
gl = [xx.loc[i]['operator_id'] for i in range(len(xx)) if xx.loc[i]['research'] < srd-dlt]
return gr+gl
elif l: # looking for the average as a maximum over all the time intervals
# in fact xx.date.min() - is the start date of the operator by which the data is filtered when calling the function
ld = xx.date.min()
rd = xx.date.min() + dt.timedelta(days=d_count)
srd = -1000000000 # just a very small number
while rd < xx.date.max():
srd = max(srd,xx.loc[(xx.date >= ld) & (xx.date < rd)]['research'].mean())
ld = rd
rd += dt.timedelta(days=d_count)
xx = xx[['date','operator_id','research']].groupby('operator_id',as_index=False)['research'].mean()
gl = [xx.loc[i]['operator_id'] for i in range(len(xx)) if xx.loc[i]['research'] < srd-dlt]
return gl
elif r: # looking for the average as a minimum over all the time intervals
ld = xx.date.min()
rd = xx.date.min() + dt.timedelta(days=d_count)
srd = 1000000000 # just a very large number
while rd < xx.date.max():
srd = min(srd,xx.loc[(xx.date >= ld) & (xx.date < rd)]['research'].mean())
ld = rd
rd += dt.timedelta(days=d_count)
xx = xx[['date','operator_id','research']].groupby('operator_id',as_index=False)['research'].mean()
gr = [xx.loc[i]['operator_id'] for i in range(len(xx)) if xx.loc[i]['research'] > srd+dlt]
return gr
else:
return []
For each callcenter (user_id), we build our own distribution, filtering by direction (in), internal (False), is_missed_call (True), ranking operators by the number of calls (calls_count).
name = 'Research 1: a lot of missed incoming calls' ###### change each time
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = False ###### collect inefficient ones from below, change each time
r = True ###### collect inefficient ones from above, change each time
research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results
for callcenter in df.user_id.unique():
option = ((df.direction=='in') ###### change each time
& (df.is_missed_call==True) ###### change each time
& (df.internal==False)
& (df.operator_id!=100000.0)
& (df.user_id==callcenter)
)
xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
xx['research'] = xx['calls_count'] ###### making a research column
dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
if anomalies != None:
for i in range(len(anomalies)):
research.loc[-1] = [callcenter, anomalies[i]] # add a row
research.index = research.index + 1 # shift the index
research = research.sort_index() # sort by index
research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')
research_01 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_01)
research_01.name = name ###### assign a name to the dataframe, we might add other service information
report(research_01) ###### print out the short result
Callcenters, anomalies and operator IDs with their "abnormal" results.
callcenter_id operator_id
0 168187 948286.0
1 168187 937956.0
2 167497 924936.0
3 167466 921814.0
4 166916 906410.0
5 166541 908960.0
6 166405 882686.0
Research 1: a lot of missed incoming calls.
Found 6 callcenters, having 7 "anoumalous" operators.
print('Missed calls with opertator IDs not 100000.0:')
len(df[(df.is_missed_call==True)&(df.operator_id!=100000.0)])
Missed calls with opertator IDs not 100000.0:
13811
print('Missed calls with operator IDs 100000.0 - lost:')
len(df[(df.is_missed_call==True)&(df.operator_id==100000.0)])
Missed calls with operator IDs 100000.0 - lost:
7286
For each callcenter (user_id), we build our own distribution, filtering by direction (in), internal (False), ranking operators by the difference in the amounts of total_call_duration and call_duration divided by the number of calls of calls_count (i.e. by the average waiting time).
name = 'Research 2: long waiting for a response on incoming external calls' ###### change each time
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = False ###### collect inefficient ones from below, change each time
r = True ###### collect inefficient ones from above, change each time
research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results
for callcenter in df.user_id.unique():
option = ((df.direction=='in') ###### change each time
& (df.is_missed_call==False) ###### change each time
& (df.internal==False)
& (df.operator_id!=100000.0)
& (df.user_id==callcenter)
)
xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
xx['research'] = round((xx['total_call_duration']-xx['call_duration'])/xx['calls_count']) ###### making a research column
dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
if anomalies != None:
for i in range(len(anomalies)):
research.loc[-1] = [callcenter, anomalies[i]] # add a row
research.index = research.index + 1 # shift the index
research = research.sort_index() # sort by index
research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')
research_02 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_02)
research_02.name = name ###### assign a name to the dataframe
report(research_02) ###### print out the short result
Callcenters, anomalies and operator IDs with their "abnormal" results.
callcenter_id operator_id
0 168601 952916.0
1 168579 973286.0
2 168412 952462.0
3 168336 969380.0
4 168336 958416.0
.. ... ...
139 166405 902510.0
140 166405 882690.0
141 166405 882688.0
142 166392 891900.0
143 166377 880022.0
[144 rows x 2 columns]
Research 2: long waiting for a response on incoming external calls.
Found 90 callcenters, having 144 "anoumalous" operators.
For each call center (user_id), we build our own distribution, filtering by direction (out), internal (False), ranking operators by the number of calls (calls_count).
Remark. Why didn't I filter the call centers by specialization, despite the task:
name = 'Research 3: few outgoing calls' ###### change each time
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = True ###### collect inefficient ones from below, change each time
r = False ###### collect inefficient ones from above, change each time
research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results
for callcenter in df.user_id.unique():
option = ((df.direction=='out') ###### change each time
& (df.is_missed_call==False) ###### change each time
& (df.internal==False)
& (df.operator_id!=100000.0)
& (df.user_id==callcenter)
)
xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
xx['research'] = xx['calls_count'] ###### making a research column
dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
if anomalies != None:
for i in range(len(anomalies)):
research.loc[-1] = [callcenter, anomalies[i]] # add a row
research.index = research.index + 1 # shift the index
research = research.sort_index() # sort by index
research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')
research_03 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_03)
research_03.name = name ###### assign a name to the dataframe
report(research_03) ###### print out the short result
Callcenters, anomalies and operator IDs with their "abnormal" results.
callcenter_id operator_id
0 168601 952916.0
1 168412 953460.0
2 168361 965538.0
3 168361 945314.0
4 168361 945274.0
.. ... ...
145 166405 882688.0
146 166405 882684.0
147 166377 881278.0
148 166377 880022.0
149 166377 880020.0
[150 rows x 2 columns]
Research 3: few outgoing calls.
Found 67 callcenters, having 150 "anoumalous" operators.
For each callcenter (user_id), we build our own distribution, filtering by direction (in), internal (False), ranking operators by the average duration of calls (call_duration), which is obtained by dividing the sum of time by the sum of the number of calls.
name = 'Research 4: conversations are too short or too long for incoming external calls' ###### change each time
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = True ###### collect inefficient ones from below, change each time
r = True ###### collect inefficient ones from above, change each time
research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results
for callcenter in df.user_id.unique():
option = ((df.direction=='in') ###### change each time
& (df.is_missed_call==False) ###### change each time
& (df.internal==False)
& (df.operator_id!=100000.0)
& (df.user_id==callcenter)
)
xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
xx['research'] = xx['call_duration'] / xx['calls_count'] ###### aking a research column
dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
if anomalies != None:
for i in range(len(anomalies)):
research.loc[-1] = [callcenter, anomalies[i]] # add a row
research.index = research.index + 1 # shift the index
research = research.sort_index() # sort by index
research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')
research_04 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_04.tail(10))
research_04.name = name ###### assign a name to the dataframe
report(research_04) ###### print out the short result
Callcenters, anomalies and operator IDs with their "abnormal" results.
callcenter_id operator_id
34 166916 906416.0
35 166916 906394.0
36 166899 894662.0
37 166604 893400.0
38 166536 900192.0
39 166536 884524.0
40 166520 923254.0
41 166407 888538.0
42 166405 939478.0
43 166392 886892.0
Research 4: conversations are too short or too long for incoming external calls.
Found 31 callcenters, having 44 "anoumalous" operators.
For each call center (user_id), we build our own distribution, filtering by direction (out), internal (False), ranking operators by the average duration of calls (call_duration), which is obtained by dividing the sum of time by the sum of the number of calls.
name = 'Research 5: conversations are too short or too long for outgoing external calls' ###### change each time
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = True ###### collect inefficient ones from below, change each time
r = True ###### collect inefficient ones from above, change each time
research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results
for callcenter in df.user_id.unique():
option = ((df.direction=='out') ###### change each time
& (df.is_missed_call==False) ###### change each time
& (df.internal==False)
& (df.operator_id!=100000.0)
& (df.user_id==callcenter)
)
xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
xx['research'] = xx['call_duration'] / xx['calls_count'] ###### making a research column
dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
if anomalies != None:
for i in range(len(anomalies)):
research.loc[-1] = [callcenter, anomalies[i]] # add a row
research.index = research.index + 1 # shift the index
research = research.sort_index() # sort by index
research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')
research_05 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_05.tail(10))
research_05.name = name ###### assign a name to the dataframe
report(research_05) ###### print out the short result
Callcenters, anomalies and operator IDs with their "abnormal" results.
callcenter_id operator_id
46 166800 892534.0
47 166680 950444.0
48 166658 891154.0
49 166520 969600.0
50 166520 910926.0
51 166520 910902.0
52 166520 909452.0
53 166520 891192.0
54 166520 890232.0
55 166485 887280.0
Research 5: conversations are too short or too long for outgoing external calls.
Found 37 callcenters, having 56 "anoumalous" operators.
Remark:
So far, only the IDs of "abnormal" operators are included in the output data. In the commercial version, the output dataframe should have more complete data that would help call centers understand why each operator got into a set of inefficient ones or vice versa.
c = len(set(research_01.callcenter_id) | set(research_02.callcenter_id) |
set(research_03.callcenter_id) | set(research_04.callcenter_id)|set(research_05.callcenter_id))
cp = round(100 * c / df.user_id.nunique())
o = len(set(research_01.operator_id) | set(research_02.operator_id) |
set(research_03.operator_id) | set(research_04.operator_id)|set(research_05.operator_id))
op = round(100 * o / df.operator_id.nunique())
print(f'In five studies, including intersections, found {c}({cp}%) callcenters, having {o}({op}%) "abnormal" operators.')
In five studies, including intersections, found 117(38%) callcenters, having 314(29%) "abnormal" operators.
Conclusion to 3:
In the last 4 weeks, the number of call centers active every day has fluctuated approximately the same. On weekends, dips are visible, the maximums are approximately at the same level (see clause 1.3).
Let's select 2 intervals - each for 2 weeks, from Monday to Sunday through Sunday:
sample_1 = df.loc[(df.date >= '2019.10.27') & (df.date < '2019.11.10')].sort_values(by='date')
sample_2 = df.loc[(df.date >= '2019.11.10') & (df.date < '2019.11.24')].sort_values(by='date')
# demonstrate this on the graph
rr = df.groupby('date', as_index=False)['user_id'].nunique() # number of callcenters per day
rr.plot(kind='line', x='date', y='user_id', figsize=(5, 3))
plt.axvline('2019.11.10', color='red', alpha = 0.5)
plt.axvline('2019.11.24', color='red', alpha = 0.5)
plt.axvline('2019.10.27', color='red', alpha = 0.5)
plt.xticks(rotation = 0)
plt.title('Number of callcenters per dates')
plt.ylabel('Number of callcenters')
plt.xlabel('Dates')
plt.legend([],[], frameon=False)
plt.show();
# creating 2 series with the numbers of callcenters in these time periods
r1 = sample_1.groupby('date', as_index=False)['user_id'].nunique().user_id
r2 = sample_2.groupby('date', as_index=False)['user_id'].nunique().user_id
# compare groups 1 and 2 by Mann-Whitney
alpha = 0.05
results = st.mannwhitneyu(r1, r2, True, 'less')
print('Wilcoxon-Mann-Whitney test p-value: ', results.pvalue)
if results.pvalue < alpha:
print('Rejecting the null hypothesis: the difference is statistically significant.')
else:
print('It was not possible to reject the null hypothesis, the difference is NOT statistically significant.')
Wilcoxon-Mann-Whitney test p-value: 0.4090863915340001 It was not possible to reject the null hypothesis, the difference is NOT statistically significant.
# the graphs are similar, Mann-Whitney also considers the sequences similar - the variances should be quite close
# results = calling a method to test a hypothesis
results = st.ttest_ind(r1, r2, equal_var=True) # consider the variances to be close
# alpha = significance level value
alpha = 0.05
# printing p-value
print(results.pvalue)
# conditional operator with the output of a response string
if results.pvalue < alpha:
print('Rejecting the null hypothesis.')
else:
print('Not possible to reject the null hypothesis.')
0.5852169029492182 Not possible to reject the null hypothesis.
alpha = 0.05 # critical level of statistical significance
successes = np.array([sample_1.is_missed_call.sum(), sample_2.is_missed_call.sum()]) # missed calls
trials = np.array([sample_1.is_missed_call.count(), sample_2.is_missed_call.count()]) # calls in total
p1 = successes[0]/trials[0] # proportions of success in the first group
p2 = successes[1]/trials[1] # proportions of success in the second group
# proportions of success in the combined dataset:
p_combined = (successes[0] + successes[1]) / (trials[0] + trials[1])
difference = p1 - p2 # difference in proportions in datasets
# statistics in the standard deviations of the standard normal distribution
z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials[0] + 1/trials[1]))
# setting the standard normal distribution (mean 0, standard deviation 1)
distr = st.norm(0, 1)
p_value = (1 - distr.cdf(abs(z_value))) * 2
print('p-value: ', p_value)
if p_value < alpha:
print('Rejecting the null hypothesis: there is a significant difference between the shares.')
else:
print('Not possible to reject the null hypothesis, there is no reason to consider the shares different.')
p-value: 0.01777018753871218 Rejecting the null hypothesis: there is a significant difference between the shares.
Conclusion to 4:
Analysis of inefficient operators:
Statistical research:
Recommendations for the client: